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Abstract —In this paper, we design an efficient algorithm 
for the energy-aware profit maximizing scheduling problem, 
where the high performance computing system administrator 
is to maximize the profit per unit time. The running time of the 
proposed algorithm is depending on the number of task types, 
while the running time of the previous algorithm is depending 
on the number of tasks. Moreover, we prove that the worst-case 
performance ratio is close to 2, which maybe the best result. 
Simulation experiments show that the proposed algorithm is 
more accurate than the previous method. 
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I. Introduction 
A. Background and Motivation 

In high-performance computing (HPC) systems, it is well 
known that when the performance is increased, the power 
consumption is increased, as well as the electricity costs for 
the operators are increased. Recently, the high cost of the 
HPC systems has lead to research that designs an efficient 
resource allocation algorithm to reduce the required energy 
consumption m . By combining the energy and performance 
objectives into a single profit objective, Tarplee et al. ID 
introduced a novel monetary-based model for HPC where 
there is a financial distinction between the service provider 
and the users. In HPC systems, there are two important facts: 
(a) The HPC systems are often composed of different types 
of machines; (b) There are a large number of tasks but only 
small number of task types. By solving a linear program and 
rounding carefully, they [lj designed an efficient algorithm 
to find a feasible schedule. 

In m, a lower bound on the finishing times of a machine 
type is used to replace makespan, which is defined as 
the maximum finishing time of all machines. Therefore, 
the proposed mathematical model is inaccurate. For the 
proposed algorithm [1], in the rounding process, the energy 
consumption maybe increased, which can be avoided by 
using a different method. Moreover, the running time is 
depending on the number of tasks, which can be improved, 
too. Most importantly, the worst-case performance ratio of 
the proposed algorithm [1J is not given. 


B. Contributions and Outline 

This paper presents an accurate mathematical model for 
the problem proposed in 0]. A polynomial-time algorithm 
is then developed to find a feasible solution for the proposed 
model. 

The contributions of this paper are: 

1) An accurate mathematical model; 

2) A task-type-based algorithm to find a more accurate 
feasible solution, whose running time is independent of the 
number of tasks; 

3) The worst-case performance ratio. 

The remainder of this paper is organized as follows. 
The next section proposes the accurate mathematical model. 
Section III presents the task-type-based algorithm and proves 
the worst-case performance ratio. Section IV gives the 
experimental results. The last section discusses the useful 
extensions to the proposed model and lists ideas for future 
work. 

II. The Mathematical Model 

As in 0], a user submits a bag-of-tasks to process, where 
each task is indivisible and independent of all the other tasks. 
The cost to the organization for processing a bag-of-tasks is 
the cost of electricity. The organization or service provider 
should maximum the profit per bag, which is equal to the 
price minus the cost. However, the bag-of-tasks can take 
a considerable amount of time to compute when trying to 
increase the profit by reducing electricity costs. Thus, it is 
more reasonable for an organization to maximize the profit 
per unit time. 

Formally, assume that there are T task types and M 
machine types. Let 71 be the set of tasks of type i and 
T, be the number of tasks in 7j. Similarly, let AAj be 
the set of machines of type j and Mj be the number of 
machines in A4j. Denote by x l3 the number of tasks of 
type i assigned to a machine of type j, where x l3 is the 
primary decision variable in the optimization problem. As 
the definitions frequently used in scheduling algorithms m, 
let ETC be a T x M matrix where ETCij is the estimated 
time to compute for a task i on a machine j. Similarly, let 
APC be a T x M matrix where APCij is the average power 
consumption for a task * on a machine j. 


Since tasks are indivisible in most cases, the x^ tasks of 
type i may not be allocated equally to the Mj machines of 
type j. For every machine jk G Mj, let Xijk be the number 
of tasks of type i assigned to machine jk- Clearly, x, :j = 
Yhk-.keMj Xi o k - The finishing time of a machine jk G Mj, 
denoted by Fjk, is given by 

T 

Fjk = ^ETCjj. ( 1 ) 

i=l 

Thus, the maximum finishing time of all machines (i.e., 
makespan), denoted by MS (pc), is given by 

MS(x) = max max Fjk- (2) 

j k:j k GMj 


In this paper, for convenience, machines are turned off when 
not use, which means that the energy consumed by the bag- 
of-tasks is given by: 

M T 

s ( x ) = EE XijAPCijETCij. (3) 

i=i i= i 


Let p be the price customer pays and c be the cost per unit 
of electrical energy. The profit that the organization receives 
by executing a bag-of-tasks is p — cE(pc). The Energy-Aware 
Profit Maximizing Scheduling (EAPMS) Problem defined in 
Iffl attempting to maximize the profit per unit time can 
be formulated as the following nonlinear integer program 
(NLIP): 


M aximize* 


p — oE(x) 
MS'(x) 


subject to Vi 


M 

E 


M 


^ ^ Xijk — ^ ^ Xij — Ti , 


j=lk:j k EMj j=l 

Vj Fjk < MS(x), for each j k G Mj ; 
Vt, j Xijk G Z> 0 ,foi each jk G Mi j. 


(4) 


The objective of (4) is to maximize the profit per unit 
time, where x is the primary decision variable. The first 
constraint ensures that all tasks of different types in the bag 
are assigned to some machine type. Because the objective 
is to maximize the profit per unit time, which is equivalent 
to minimize makespan, the second constrain ensures that 
MS (pc) is equal to the maximum finishing time of all 
machines. 


III. A Task-Type-Based Algorithm 
A. Overview 

Note that (4) is a nonlinear integer program, which can 
not be solved optimally in polynomial time. To obtain an 
approximate solution of (4), one possible way is to convert 
(4) to an equivalent linear program (LP), and then to round 
the optimal fraction solution of LP to a feasible solution 
for (4). In m, the authors obtained a linear program using 
variable substitution r 4— 1/MSlb and Zij 4— x^ /MSlb, 


where MSlb = max y - x ijETC,j is a lower 

bound on the makespan obtained by allowing tasks to be 
divided among all machines. However, the approximation 
of this method would be bad when the objective value is 
close to 0 or little tasks of type i with large ETCij are 
assigned to machines of type j. A similar phenomenon is 
also observed by Tarplee et al. 0. 

To overcome the obstacle mentioned above, we will use a 
different method. We replace MS (pc) with a constant MS, 
and then obtain an approximate integer linear program (ILP) 
for (4). By rounding the optimal fraction solution for the 
relaxation of ILP based on the classic rounding algorithm for 
the generalized assignment problem 0, we obtain a feasible 
solution for (4). It is desired to point out that, in our method, 
the tasks of type i such that ETCij > MS will not be 
assigned to machines of type j, which is to avoid increasing 
the makespan too much when rounding the optimal fraction 
solution. 

Let LB be the optimal makespan by ignoring the energy 
consumption, and UB be the makespan of the feasible 
schedule by assigning each task to the machine with min¬ 
imum average power consumption APCij. For any given 
constant e > 0, Clearly, the makespan AIS(x.*) of the 
optimal solution x* for (4) lies in [Lf?(l+e) t , LB(l+e) t+1 ], 
for some t G {1, 2,..., |’log( 1+e ) UB/LB ~\}. By trying 
all possible values, we will find a feasible makespan MS 
such that MS(x*) G [MS/(1 + e),MS], where MS = 
LB( 1 + e) 4 for some t. For convenience, from now on, 
assume that MS is a known constant satisfying 

MS(x*) < MS < (1 + e)MS(x*). (5) 

For a constant MS, as in (T), our algorithm is decomposed 
into two phases. This first phase rounds the fraction optimal 
solution to obtain a schedule where the numbers Xij of tasks 
of type i assigned to machines of type j are given. The 
second phase assigns tasks to actual machines to produce the 
full task allocation x^k- The next two subsections describe 
the two phases of this recovery procedure in detail. 

There are two main differences between Tarplee, Ma- 
ciejewski, and Siegel’s (TMS, for short) method 0 and our 
task-type-based (TTB, for short) method (depicted in Figure 
L): (1) The TMS method uses one fractional solution to 
round while we use multiple fractional solutions and choose 
the best one; (2) In the first phase, the energy consumption 
may increase in Tarplee et al.’s method while it will not 
increase in our method. 

B. b-Matching-Based Rounding 

Note that if ETCij > MS, the tasks of type i can not be 
assigned to the machines of type j in the optimal solution, 
by the definition of MS. This implies that x^k = x^ = 
0, if i,j,k satisfy that ETCij > MS and jk G Mj. As 
mentioned in GtiEf. : XijETCij is a lower bound on 
MS. Since 




(a) TMS method 



energy 

(b) TTB method 


Figure 1. Comparing the main ideas of two algorithms 


MS is constant close to MS(x*), we can substitute MS for 
MS'(x) in (4). Since p,MS,c are constants, the objective 
maximizing (p — cE(x))/MS = p/MS — cE[x)/MS is 
equivalent to minimizing E(x). Thus, we obtain an approx¬ 
imate equivalent integer programming formula for NLIP (4): 

M T 

Minimize x E(x) = EE Xtj APCij ET Cij 

j= 1 i=i 
M 

subject to Vi E Xjj — f j - 

i=1 ( 6 ) 
1 T 

Y? iif" E! XjjETCjj < MS', 

i t=i 

atjj G Z>o,for each i,j ; 

Xij = 0, if ETCij > MS. 


Theorem 1. Any optimal solution x* /or NLIP (4) is a 
feasible solution for (6). 

Replacing the constraint Xij G Z>o with > 0, we 
obtain the relaxation of (6), which is a linear program and 
can be solved in polynomial time. Noting that there are TM 
variables and T + M nontrivial constraints, both are less 
than that in the linear program (10) in 0J- By modifying 
Shmoys & Tardos’s rounding method in a, which is to 
find a minimum-weight matching of an auxiliary bipartite 
graph -B(x), we can convert a feasible solution x for the 
relaxation of (6) to a feasible solution x for (6). An important 
observation is that x satisfies MS(x) < 2 MS and E(x) = 
E{x) < E(x*). 

Note that the running time of Shmoys & Tardos’s round¬ 
ing method 0 is dependent on the number of tasks, which 
is very large in reality m. To reduce the running time, 
we will replace minimum-weight matching by minimum- 
weight b-matching HI to design an algorithm whose run¬ 
ning time is dependent on the number of task types. For 
completeness, we present the modified Shmoys & Tardos’s 
rounding method in 0 as follows. Here, for simplicity, we 
only show how to construct the bipartite graph £?(x) and the 
edge weights, ignoring the fraction solution of the matching. 
Given a feasible solution x for the relaxation of (6), let 
x'ij = — [xij\, for i = 1 and j = 1 ,...,M. 


Construct a weighted bipartite graph H(x) = (U,V, E\w), 
where U = [u\,..., ut } represent the set of task types. 
The other node set V = {vj S \j = 1,..., M, s = 1.... ,k 3 } 
consists of machine-type nodes, where kj = [XEi x ij 1 and 
kj nodes Vj s , s = 1,..., kj, correspond to machine type j, 
for j = 

As in 0, the edges in E of the bipartite graph B(x) will 
correspond to task-machine pairs (i,j), such that xE > 0. 
To construct the edges incident to the nodes in V corre¬ 
sponding to machine type j , sort the task types in order 
of nonincreasing estimated times to compute ETCij. For 
simplicity, assume that 

ETCij > ETC 2 j > ■■■> ETCtj- (J) 

if Ef= i xij < 1, then kj = 1, which implies that there 
is only one node Vji G V corresponding to machine type j. 
For each x\j > 0, include {vji,uf) G E. Otherwise, find the 
minimum index i\ such that XEi x ij — I- Let E contain 
those edges {vjx,uf) G E, i = 1,..., ii, for which x t j > 0. 
For each s = 2,..., kj — 1, find the minimum index i s such 
that EEi xij > s. Let E contain those edges ( Vj S ,Ui ), 
i = i s -1 + 1,..., i s , for which x\j > 0. If EEi x ij > 
then also put edge , u ls ) G E. Finally, put edges 

{Vjkj , ufj G E, i = i kj -1 + 1,..., T, for which x' tj > 0. 

For each edge {vj s ,uf) G E, let the weight of edge 
( Vj S ,Ui ) be w(vj s ,Ui) = APCijETCij. For each task-type 
node it,; G U , let the capacity of it, be 6, = Ej=i x ij ’ wh ere 
bi is an integer as X0 =1 x ij = Ej=l X H - Ej=i l x ij\ = 
Tj — E -li [xij\ is an integer. From the construction of the 
bipartite graph /j(x), it is easy to verify that there are at most 
T nodes in U and at most Ej=i % — ULT nodes in V. As 
there are T + M nontrivial constraints in (6), the number 
of positive variables in x is at most T + M, following from 
the property of linear programming. Combining the fact that 
there are one or two corresponding edges in E for each 
xij > 0, there are at most 2 (T + M) edges in E. Therefore, 
the minimum-cost b-matching BAA, that exactly matches bi 
times of the task-type node u, in EM, can be found by 
using the method in j4], whose running time is polynomial 
in T and M. 

The modified Shmoys & Tardos’s rounding method algo¬ 
rithm to construct a schedule x^ from a feasible solution x 
of the relaxation of (6) is summarized as follows. 

Algorithm A 

Step 1. Form the bipartite graph EM) with weights on its 
edges as described above. 

Step 2. Use the method in f4j to find a minimum-weight 
(integer) 6-matching BAA that exactly matches bi times of 
the task-type node Uj in f?(x). 

Step 3. For each edge ( Vj S ,Ui ) G BAA, assign a task of 
type i on a machine of type j, which implies that x t j = 
\xij\ + |{(uj s ,iij)|(uy s ,'Uj) G BA4}\, for every i,j. 






Theorem 2. £3) The schedule x obtained by ALGORITHM 
A has makespan at most 2 MS, and the energy consumption 
is at most solution E(yx.*). 

C. Task-Type-Based Local Assignment 

Recall that a feasible schedule is to assign every indivis¬ 
ible task to a specific machine. The solution Xij obtained 
in the last subsection is to assign x,j tasks of type i to 
machines of type j. To obtain a feasible schedule, we need 
to schedule the tasks already assigned to each machine 
type to specific machines within that group. In a group of 
machines of type j, ETCij and APCij are only dependent 
on the task type i. Thus, the total energy consumed by 
machines of type j is x V j APC i3 ETC t] , which is a 

constant. Therefore, we only need to schedule tasks to min¬ 
imize makespan, which is equivalent to the multiprocessor 
scheduling problem 0. Tarplee et al. 0 use the common 
longest processing time (LPT) algorithm to assign tasks to 
machines for each machine type, where the i %ij tasks 
are sorted in descending order by execution time, and each 
task is assigned to the machine that will complete earliest. 

As shown in 0, the effect of the sub-optimality of LPT 
algorithm on the overall performance of the systems consider 
is insignificant, as the number of tasks is large generally. 
However, this leads to another problem, that the running 
time of LPT algorithm will increase dramatically when the 
number of tasks grows rapidly. Note that in the HPC system, 
the number of types of tasks is always much less than that 
of tasks. For example, in the simulations of 0, there are 
30 task types, yet there are 11,000 tasks. An important 
observation is that we do not need to assign one task at 
a time when assign the tasks of same type. 

Each group of machines of type j is processed indepen¬ 
dently. The task types are sorted in descending order by ex¬ 
ecution time ETCij, which can be done within 0(T log T) 
time. Without loss of generality, assume ETCij > • ■ • > 
ETC’Tj■ For each machine jt G Mj, let L l k be the current 
load of machine jk after assigning tasks of type i, where 
the load of machine jk is the total processing time of tasks 
assigned to it. Initially, L k = 0 for each jk G Mj. Let ALi 
be the average load of machines of type j after assigning 
the tasks of type i, where 

E k- ik eM L V + ETdjXij 
AL = (g) 
Mj 

For k = 1, ..., Mj, assuming there are N unaS sign unas¬ 
signed tasks, schedule min {N unass i gn , N^} tasks of type i 
simultaneously to machine jk, where 
AT . _ T i ~ 1 

Nl — max{ |_ ' ETC * J, 0}- (9) 

If the load of a machine jk is increased meaning N k > 0, 
we have 

ALi - ETCij < = Ljp 1 + N^ETCij < AL,. (10) 


Obviously, there are at most Mj unassigned tasks of type 
i, which can be assigned using LPT algorithm. It is easy 
to verify that our method is equivalent to the LPT al¬ 
gorithm in 0. However, the running time is reduced to 
0 (EjIi(TlogT + TMj)), not depending on the number 
of tasks, which is always a huge number in the HPC system. 

Algorithm B shows the pseudo-code for assigning tasks 
to machines for each type. 


Algorithm B Assign tasks to machines for each type. 


1: For j = 1 to M do 

2: Relabel the indices such that ETCij > • • • > ETCt-j ; 

3: For i = 1 to T do 

4: For each machine jk G Mj do 

5: Assign N k (defined in (9)) tasks of type i to 

it, if there are unassigned tasks; 

6: End for 

7: Use LPT algorithm to assign the remaining tasks 

of type i (at most Mj)\ 

8: End for 

9: End for 


D. Performance Analysis 

In summary, for each t G {1,..., [log/ 1+e ) UB/LB\}, 
let MS = LB( 1 + e)*. Then, use ALGORITHM A and 
Algorithm B to find a feasible solution for (4). Among 
these solutions (at most [log( 1+e \ UB / LBf), choose the one 
with maximum profit per unit time. It is easy to verify that 
the total running time is independent of the number of tasks. 

For a maximization problem, if algorithm A can produce a 
feasible solution with the objective value at least OPT / p for 
any instance, where OPT denotes the optimal value, then p 
is called the worst-case performance ratio or approximation 
ratio. 

Combining (5) and Theorem 2, the objective of the 
schedule x is at least 


p — cE(x) 
2 MS 


> 

> 


p — cE(x.*) 
2 MS 


> 


1 


2 + 2e 


-OPT. 


p — cE(x*) 

2(1 + e)MS(x*) 


It implies that the worst-case performance ratio of the 
proposed algorithm is 2 + 2e, for any e > 0. 


IV. Experimental Results 

Simulation experiments were performed to compare the 
quality of TMS and TTB methods. As in 0, the software 
was written in C++ and the LP solver used the simplex 
method from COIN-OR CLP gj. 

Without loss of generality, assume that c = 1 for all the 
experiments. As in (T), let E rn , iri be the lower bound on the 












minimum energy consumed when ignoring makespan, and 
p = 'yEmin, where 7 = p/E m i n is a parameter that will 
be used to affect the price per bag. Clearly, when 7 is large 
enough, the focus is to minimize the makespan ID- Thus, 
we only consider the case that 7 G [1,1.5]. 

For all the simulations, there are nine machine types 
and 40 machines of each type for a total of 360 machine, 
as in (TJ. Our first experiment is based on a benchmark 
G) with nine machine types and five task types, where 
the missing values are deleted. The workload consists of 
12, 000 tasks divided among 5 task types. When 7 is 
varying, different solutions produced by the TMS and TTB 
methods are shown in Table 1. The table shows that every 
solution produced by the TTB method is better than that 
produced by the TMS method. Especially, when 7 = 1 , 
because the rounding method in the TMS method will 
increase the energy consumption, the TMS method produces 
a solution with negative objective value, while the TTB 
method produces the optimal solution. 


7 = 

1 

1.1 

1.2 

1.3 

1.4 

1.5 

TMS 

-0.6 

985.1 

1998.8 

3505.4 

5491.4 

7933.9 

TTB 

0.0 

986.1 

2009.0 

3529.8 

5510.7 

7986.0 
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Figure 3. 25 randomized experiments 


Although experiments show that the solution produced 
by the TTB algorithms is close to the optimal solution, 
this does not hold in a worst-case scenario. It is interesting 
and challenging to design a polynomial-time algorithm with 
worst-case performance ratio less than 2 . 
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Table 1 . The solutions with 7 varying from 1 to 1.5 

Since ETCij and APCij differ slightly in the benchmark 
13, to quantify the quality of the solutions in a more general 
case, we did 25 experiments where ETCij and APCij are 
random numbers between 0 and 1. In the q-th experiment, 
q = 1,..., 25, the workload consists of 150q tasks divided 
among 30 task types. Figure 2 shows the profit per unit time 
computed from the TMS and TTB methods when 7 = 1.2. 
The figure shows that every solution produced by the TTB 
method has a higher profit per unit time. When the number of 
tasks is large enough, the solutions produced by two methods 
are close to each other. 

In fact, for every experiment where 7 is also a ran¬ 
dom number we have done, the TTB method produces a 
higher quality solution.Moreover, in ( 6 ), letting MS be the 
makespan of the solution produced by the TMS method, we 
can obtain a better solution by using the 5-matching-based 
rounding and task-type-based local assignment method in 
Section III. It is worth to mentioning that the TTB method 
performs much better when 7 is small or the average number 
of tasks per machine is small. 

V. Discussion and future work 

With small modifications, our algorithm can be extended 
to the idle power consumption or the case where there is 
upper bound on the allowed power consumption, which are 
considered in (T). Due to space constraint, we omit the 
details here. 
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